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Backpropagation is a commonly known algorithm for computing the gradient of error function 
which arises when we know target values and the loss, cost or error function is one dimensional. 
Generalizing this to general gradient calculation when we seek to find the maximum or minimum 
value of a neural network (thought often ill-fated because of local optimas produced by an over- 
fitted neural network) is then important. This is needed, for example, when implementing certain 
reinforcement learning methods. 


Consider a two-layer neural network 

y{x) = /( giW^x + 6 (1 ^) + b (2) ) 


The gradients of the final layer are (non-zero terms are at the j:th row): 
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The derivation chain-rule can be used to calculate the second (and more deep layers’ gradients): 
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By analysing the chain rule we can derive generic backpropagation formula for the full gradient. Let 
v^ be a fc:th layers local field, v^ /(f < ' fc_1 ' 1 ) + b^ k \ Then local gradient matrices 8^ are 
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And network’s parameter gradient matrices for each layer are (only j: th element of each row is 
non-zero): 
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To test that gradient matrix is correctly computed it can be compared with normal squared error 
calculations (normal backpropagation). 

e{x\w)=^\\yi-y(x\w)\\ 2 


Sometimes also needs gradient with respect to x and not weights parameters w. This can be 
calculated using the chain rule again. For simplicity, let’s consider two-layer case initially. 

g(x) = f{W (2) h(W (1) x + b (1) ) + b (2) ) 

The gradient is: 
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This results into following formula (diag() entries are square matrices which diagonal is nonzero): 


\' x g(x\w) =diag(V^(L)/(u (L) ))W (i) ...diag(V„( 2 )/(u (2) ))W {2) diag(V 1 ,(i)/(u (1) ))Ty {1) 


In continuous reinforcement learning, we need to maximize the given policy’s fi average Q-value 
Q{x, n(x)) which gradient can be computed by using the chain-rule but there is additionally 
linear pre- and postprocessings Wx + b in /x and Q which makes the calculation of gradient more 
complicated. 

v ( w M * +bM ) W6 Q(Wq z+b Q ) + b' Q ,z=[x, w 4 fJL(w llX + bj + &;j = 

W' Q S7Q(Wq z + b Q )W^Wftfi 

But in practice we don’t have post-processing for fi so the gradient becomes 

W' Q VQ(W q z + b Q )W^Vn 


Recurrent Neural Networks and Backpropagation (Similar to RTRL) 

The basic learning algorithm for recurrent neural networks (RNN) is BPTT but I use modified 
RTRL instead. (RTRL - real time recurrent learning). This is done by unfolding neural net in time 
an computing the gradients. The recurrent neural network is 

u(n+ !!) = ( “["+ 1 ) ) = f(x(n),r(n)) 

The error function to minimize is: 

E ( N ) = ^T,n=i W d ( n + 1 ) - T vf( x ( n ), Tru(n - l)|tu)|| 2 

In which T matrices are used to select y(n) and r(n) vectors from generic output vector and the 
initial input to feedforward neural network is zero ■u(O) = 0. 

It is possible to calculate gradient of / using the chain rule 

^p- = J2n=o (r y f{x{n),r r u(n - l)\w) - d(n+ 1)) T r y V w f(x(n),T r u(n - l)\w) 

To calculate the gradient V TO / one must remember that u(n) now also depends on w resulting 
into eq: 

V„/(*(»), T r u(n -!))=£ + % 

To further compute gradients we get a generic update rule 

du(n) _ df . df p du(n - 1) 
dw dw dr 1 dw 

The computation of gradients can be therefore bootstrapped by setting an d itera¬ 

tively updating u gradient while computing the current error for the timestep. 

RNN-RBM 

RNN-RBM was described on the web to create learn creating “music” by using BPTT but I use 
extended RTRL approach instead. 
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(http://danshiebler.com/2016-08-17-musical-tensorflow-part-two-the-rnn-rbm/) 

In RNN-RBM we have a standard RBM model but RBM’s biases a(n + 1) and b(n + 1) are 
generated by a recurrent neural network ^ J = r(n)\w) , y(n + l) = ^ b(”+i) ) anc ^ 

visible units (MIDI notes) are fed to be inputs of recurrent neural network x{n) =v{n). 


One can then compute RBM’s log-likelihood gradient with respect to recurrent neural networks 
weights w maximizing probability of “semi” independent MIDI notes observations 


-log[p(u(l), v(2)...v(N))} « J2 n -log (p{v n )) 

We want to calculate gradient with respect to w where only elements a and b depend on w. 
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This can be rewritten using free-energy p(v) = — e F ^' v ' > = — e E t v > h ) and the gradient formula 
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We assume GB-RBM model (see RBM_notes .tm) with the following energy function 

E gb {v, h) = i(u — a) T S~ 1 (v — a) — (S -0 5 v) T Wh — b T h 

And we extend RNN to also output/predict variance z(n) = log(diag(S(n))) [here we use z(n) 
for two different things, one for variance parameter of GB-RBM and other for recurrent neural 
networks output]. Our gradients therefore are (see RBM_notes . tm): 
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When using these gradients it is important to remember that, in general, one must update variance 
terms 2 : independently from other parameters or the GB-RBM doesn’t converge. Initially, however, 
I will make an attempt to learn both variance z, a and b because they now all depend on common 
weight vector parameter w. Pseudocode for RNN-RBM optimization: 

1. randomly initialize W and w using small values. 

2. Calculate parameters a, b and 2 : using current RNN (initially with zero recurrent input para¬ 
meters including previous step’s visible MIDI notes v(n — 1)) 

3. Use RBM and CD-k to calculate ( v,h ) parameters for input sample(s) and calculate contrastive 
divergence samples in order to calculate gradient of free energy V.F. 

4. Calculate the gradient of the recurrent neural network V TO tt(n) and use S7F to calculate gradient 
of the probability p(v) with respect to W and w. 
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5. Repeat steps 2-4 for each song i (time series) of visible notes {vi(n)} and use the sum of all 
songs gradients to move parameters W and w towards (hopefully) higher probability of data. 
(IMPLEMENTATION NOTE: concatenate all songs as a single time-serie and try to learn it). 

BB-RBM 

Instead of GB-RBM which variance learning is complicated. I initially (and also) implement and 
test BB-RBM implementation as the RBM part of the RNN-RBM. In this case there is no variance 
terms Zj to worry about. 


—k = — sigmoid(Wv + b) 

= — sigmoid (Wv + b)v T 


NOTE: initial use of RNN-RBM outlined here seem to diverge quickly to chaos (many random 
notes played at once) when applying RNN-RBM to classical MIDI notes data (note range: C-4 .. 
B-6). It seems problem should be regularized somehow to limit number of on notes played at once. 

In practice it seems to be difficult to regularize RNN-RBM because of the special form of the error 
function (log probability). I suggest the use of “negative gradient”. In addition to training samples 
which probability should be maximized, artificial songs are created where each note has random 
probability p > 0.50 of being in on position and gradient of these additional training samples is 
calculated normally but the calculated gradient is substracted from the positive gradient so that 
probability of those “random songs” is greatly reduced. 
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